N 1 N 1 + N 2. Pr(I = I 0 ) = ˆπ(A) π(a) Pr(I A Ē) + Pr(E) π(a) Ω + δ A

Size: px

Start display at page:

Download "N 1 N 1 + N 2. Pr(I = I 0 ) = ˆπ(A) π(a) Pr(I A Ē) + Pr(E) π(a) Ω + δ A"

Philip Garrett
5 years ago
Views:

1 8 CHAPTER 1. SAMPLING AND COUNTING Thus Pr(I = ) 2/3 as required. (d) This is clearly true if V =. If V and v = maxv I 0 then, by induction Pr(I = I 0 ) = and similarly Pr(I = I 0 ) = φ if v / I 0. N 1 N 1 + N 2 φ N 1 + N 2 N 1 (e) Let E denote the event that some output of approxcount is bad in the iteration that produces output. Then for A Ω, = φ ˆπ(A) π(a) Pr(I A Ē) + Pr(E) π(a) A Ω + δ A Ω δ. We have therefore shown that by running Ugenx for constant expected number of times, we will with probability at least 1 δ output a randomly chosen independent set. The expected running time of Ugen is clearly as given in (1.11) which is small enough to make it a good sampler. Having dealt with a specific example we see how to put the above ideas into a formal framework. Before doing this we enumerate some basic facts about Markov Chains. 1.3 Markov Chains Throughout N = {0, 1, 2,...}, N + = N \ {0}, Q + = {q Q : q > 0}, and [n] = {1, 2,...,n} for n N +. A Markov chain M on the finite state space Ω, with transition matrix P is a sequence of random variables X t, t = 0, 1, 2,..., which satisfy Pr(X t = σ X t 1 = ω,x t 2,...,X 0 ) = P(ω,σ) (t = 1, 2,...), We sometimes write P ω σ. The value of X t is referred to as the state of M at time t. Consider the digraph D M = (Ω,A) where A = {(σ,ω) Ω Ω : P(σ,ω) > 0}. We will by and large be concerned with chains that satisfy the following assumptions: M1 The digraph D M is strongly connected. M2 gcd{ C : C is a directed cycle of D M } = 1

2 1.3. MARKOV CHAINS 9 Under these assumptions, M is ergodic and therefore has a unique stationary distribution π i.e. lim Pr(X t = ω X 0 = σ) = π(ω) (1.12) t i.e. the limit does not depend on the starting state X 0. Furthermore, π is the unique left eigen-vector of P with eigenvalue 1 i.e. satisfying P T π = π. (1.13) Another useful fact is that if τ σ denotes the expected number of steps between successive visits to state σ then τ σ = 1 π(σ). (1.14) In most cases of interest, M is reversible, i.e. Q(ω,σ) = π(ω)p(ω,σ) = π(σ)p(σ,ω) ( ω,σ Ω). (1.15) The central role of reversible chains in applications rests on the fact that π can be deduced from (1.15). If µ : Ω R satisfies (1.15), then it determines π up to normalization. Indeed, if (1.15) holds and ω Ω π(ω) = 1 then = ω Ωπ(ω)P(ω,σ) π(σ)p(σ,ω) = π(σ) ω Ω which proves that π is a left eigenvector with eigenvalue 1. In fact, we often design the chain to satisfy (1.15). Without reversibility, there is no apparent method of determining π, other than to explicitly construct the transition matrix, an exponential time (and space) computation in our setting. As a canonical example of a reversible chain we have a random walk on a graph. A random walk on the undirected graph G = (V, E) is a Markov chain with state space V associated with a particle that moves from vertex to vertex according to the following rule: the probability of a transition from vertex i, of degree d i, to vertex j is 1 d i if {i,j} E, and 0 otherwise. Its stationary distribution is given by π(v) = d v 2 E v V. (1.16) To see this note that Q(v,w) = Q(w,v) if v,w are not adjacent and otherwise Q(v,w) = 1 2 E = Q(w,v), verifying the detailed balance equations (1.15). Note that if G is a regular graph then the steady state is uniform over V.

3 10 CHAPTER 1. SAMPLING AND COUNTING If G is bipartite then the walk as described is not ergodic. This is because all cycles are of even length. This is usually handled by adding d v loops to vertex v for each vertex v. (Each loop counts as a single exit from v.) The net effect of this is to make the particle stay put with probability 1 at each step. The steady state is unaffected. The chain is 2 now lazy. A chain is lazy if P(ω,ω) 1 2 for all ω Ω. If p 0 (ω) = Pr(X 0 = ω), then p t (σ) = ω p 0(ω)P t (ω,σ) is the distribution at time t. As a measure of convergence, the natural choice in this context is variation distance. The mixing time of the chain is then τ(ε) = max p 0 min{d tv (p t,π) ε}, t and it is easy to show that the maximum occurs when X 0 = ω 0, with probability one, for some state ω 0. This is because D tv (p t,π) is a convex function of p 0 and so the maximum of D tv (p t,π) occurs at an extreme point of the set of probabilities p 0. I think this should be moved to the next chapter We now provide a simple lemma which indicates that variation distance D tv (p t,π) goes to zero exponentially. We define several related quantities: p (i) t denotes the t-fold distribution, conditional on X 0 = i. d i (t) = D tv (p (i) t,π), d(t) = maxd i (t), d(t) = max D tv (p (i) t,p (j) t ). i i,j Lemma For all s, t 0, (a) d(s + t) d(s) d(t). (b) d(s + t) 2d(s)d(t). (c) d(s) 2 d(s). (d) d(s) d(t) for s t. Proof We will use the characterisation of variation distance as D tv (µ 1,µ 2 ) = minpr(x 1 X 2 ) (1.17) where the minimum is taken over pairs of random variables X 1,X 2 such that X i has distribution µ i,i = 1, 2. Fix states i 1,i 2 and times s,t and let Y 1,Y 2 denote the chains started at i 1,i 2 respectively. By (1.17) we can construct a joint distribution for (Y 1 ) such that Pr(Y 1 s Y 2 s ) = D tv (p (i 1) s,p (i 2) s s,y s 2 ) d(s).

4 1.4. A FORMAL COMPUTATIONAL FRAMEWORK 11 Now for each pair j 1,j 2 we can use (1.17) to construct a joint distribution for (Y 1 such that Pr(Ys+t 1 Ys+t 2 Ys 1 = j 1,Ys 2 = j 2 ) = D tv (p (j 1) t,p (j 2) t ). The RHS is 0 if j 1 = j 2 and otherwise at most d(t). So, unconditionally, Pr(Y 1 s+t Y 2 s+t) and (1.17) establishes part (a) of the lemma. d(s) d(t) For part (b), the same argument, with Y 2 now being the stationary chain shows s+t,ys+t) 2 d(s + t) d(s) d(t) (1.18) and so (b) will follow from (c), which follows from the triangular inequality for variation distance. Finally note that (d) follows from (1.18). We will for the most part use carefully defined Markov chains as our good samplers. As an example, we now define a simple chain with state space Ω equal to the collection of independent sets of a graph G. The chain is ergodic and its steady state is uniform over Ω. So, running the chain for sufficiently long will produce a near uniformly chosen independent set, see (1.12). Unfortunately, this chain does not have a small enough mixing time for this to qualify as a good sampler, unless (G) 4. We define the chain as follows: suppose X t = I. Then we choose a vertex v of G uniformly at random. If v I then we put X t+1 = I \ {v}. If v / I and I {v} is an indepedent set then we put X t+1 = I {v}. Otherwise we let X t+1 = X t = I. Thus the transition matrix can be described as follows: n = V and I,J are independent sets of G. { 1 P(I,J) = I J = 1 n 0 otherwise Here I J denotes the symmetric difference (I \ J) (J \ I). The chain satisfies M1 and M2: In D M every vertex can reach and is reachable from, implying M1 holds. Also, D M contains loops unless G has no edges. In both cases M2 holds trivially. Note finally that P(I,J) = P(J,I) and so (1.15) holds with π(i) = 1. Thus the chain Ω is reversible and the steady state is uniform. 1.4 A formal computational framework The sample spaces we have in mind are sets of combinatorial objects. However, in order to discuss the computational complexity of generation, it is necessary to consider a sequence of instances of increasing size. We therefore work within the following formal

5 CHAPTER 3 Markov Chain Monte Carlo: Metropolis and Glauber Chains 3.1. Introduction Given an irreducible transition matrix P, there is a unique stationary distribution π satisfying π = πp, which we constructed in Section 1.5. We now consider the inverse problem: given a probability distribution π on X, can we find a transition matrix P for which π is its stationary distribution? The following example illustrates why this is a natural problem to consider. A random sample from a finite set X will mean a random uniform selection from X, i.e., one such that each element has the same chance 1/ X of being chosen. Fix a set {1, 2,..., q} of colors. A proper q-coloring of a graph G = (V, E) is an assignment of colors to the vertices V, subject to the constraint that neighboring vertices do not receive the same color. There are (at least) two reasons to look for an efficient method to sample from X, the set of all proper q-colorings. If a random sample can be produced, then the size of X can be estimated (as we discuss in detail in Section ). Also, if it is possible to sample from X, then average characteristics of colorings can be studied via simulation. For some graphs, e.g. trees, there are simple recursive methods for generating a random proper coloring (see Example 14.12). However, for other graphs it can be challenging to directly construct a random sample. One approach is to use Markov chains to sample: suppose that (X t ) is a chain with state space X and with stationary distribution uniform on X (in Section 3.3, we will construct one such chain). By the Convergence Theorem (Theorem 4.9, whose proof we have not yet given but have often foreshadowed), X t is approximately uniformly distributed when t is large. This method of sampling from a given probability distribution is called Markov chain Monte Carlo. Suppose π is a probability distribution on X. If a Markov chain (X t ) with stationary distribution π can be constructed, then, for t large enough, the distribution of X t is close to π. The focus of this book is to determine how large t must be to obtain a sufficiently close approximation. In this chapter we will focus on the task of finding chains with a given stationary distribution Metropolis Chains Given some chain with state space X and an arbitrary stationary distribution, can the chain be modified so that the new chain has the stationary distribution π? The Metropolis algorithm accomplishes this Symmetric base chain. Suppose that Ψ is a symmetric transition matrix. In this case, Ψ is reversible with respect to the uniform distribution on X. 38

6 3.2. METROPOLIS CHAINS 39 We now show how to modify transitions made according to Ψ to obtain a chain with stationary distribution π, given an arbitrary probability distribution π on X. The new chain evolves as follows: when at state x, a candidate move is generated from the distribution Ψ(x, ). If the proposed new state is y, then the move is censored with probability 1 a(x, y). That is, with probability a(x, y), the state y is accepted so that the next state of the chain is y, and with the remaining probability 1 a(x, y), the chain remains at x. Rejecting moves slows the chain and can reduce its computational efficiency but may be necessary to achieve a specific stationary distribution. We will discuss how to choose the acceptance probability a(x, y) below, but for now observe that the transition matrix P of the new chain is Ψ(x, y)a(x, y) if y x, P (x, y) = 1 Ψ(x, z)a(x, z) if y = x. z : z x By Proposition 1.20, the transition matrix P has stationary distribution π if π(x)ψ(x, y)a(x, y) = π(y)ψ(y, x)a(y, x) (3.1) for all x y. Since we have assumed Ψ is symmetric, equation (3.1) holds if and only if b(x, y) = b(y, x), (3.2) where b(x, y) = π(x)a(x, y). Because a(x, y) is a probability and must satisfy a(x, y) 1, the function b must obey the constraints b(x, y) π(x), b(x, y) = b(y, x) π(y). (3.3) Since rejecting the moves of the original chain Ψ is wasteful, a solution b to (3.2) and (3.3) should be chosen which is as large as possible. Clearly, all solutions are bounded above by b (x, y) := π(x) π(y) := min{π(x), π(y)}. For this choice, the acceptance probability a(x, y) is equal to (π(y)/π(x)) 1. The Metropolis chain for a probability π and a symmetric transition matrix Ψ is defined as [ ] Ψ(x, y) 1 π(y) π(x) if y x, P (x, y) = 1 [ ] z : z x Ψ(x, z) 1 π(z) if y = x. Our discussion above shows that π is indeed a stationary distribution for the Metropolis chain. Remark 3.1. A very important feature of the Metropolis chain is that it only depends on the ratios π(x)/π(y). In many cases of interest, π(x) has the form h(x)/z, where the function h : X [0, ) is known and Z = h(x) is a normalizing constant. It may be difficult to explicitly compute Z, especially if X is large. Because the Metropolis chain only depends on h(x)/h(y), it is not necessary to compute the constant Z in order to simulate the chain. The optimization chains described below (Example 3.2) are examples of this type. Example 3.2 (Optimization). Let f be a real-valued function defined on the vertex set X of a graph. In many applications it is desirable to find a vertex x where f(x) is maximal. If the domain X is very large, then an exhaustive search may be too expensive. π(x)

7 40 3. MARKOV CHAIN MONTE CARLO: METROPOLIS AND GLAUBER CHAINS f(x) Figure 3.1. A hill climb algorithm may become trapped at a local maximum. x 0 A hill climb is an algorithm which attempts to locate the maximum values of f as follows: when at x, if there is at least one neighbor y of x satisfying f(y) > f(x), move to a neighbor with the largest value of f. The climber may become stranded at local maxima see Figure 3.1. One solution is to randomize moves so that instead of always remaining at a local maximum, with some probability the climber moves to lower states. Suppose for simplicity that X is a regular graph, so that simple random walk on X has a symmetric transition matrix. Fix λ 1 and define π λ (x) = λf(x) Z(λ), where Z(λ) := λf(x) is the normalizing constant that makes π λ a probability measure (as mentioned in Remark 3.1, running the Metropolis chain does not require computation of Z(λ), which may be prohibitively expensive to compute). Since π λ (x) is increasing in f(x), the measure π λ favors vertices x for which f(x) is large. If f(y) < f(x), the Metropolis chain accepts a transition x y with probability λ [f(x) f(y)]. As λ, the chain more closely resembles the deterministic hill climb. Define { } X := x X : f(x) = f := max f(y). y X Then lim π λ(x) = lim λ λ λ f(x) /λ f X + = 1 { } \X λf(x) /λ f X. That is, as λ, the stationary distribution π λ of this Metropolis chain converges to the uniform distribution over the global maxima of f General base chain. The Metropolis chain can also be defined when the initial transition matrix is not symmetric. For a general (irreducible) transition matrix Ψ and an arbitrary probability distribution π on X, the Metropolized chain is executed as follows. When at state x, generate a state y from Ψ(x, ). Move to

8 3.3. GLAUBER DYNAMICS 41 y with probability π(y)ψ(y, x) 1, (3.4) π(x)ψ(x, y) and remain at x with the complementary probability. The transition matrix P for this chain is [ ] Ψ(x, y) π(y)ψ(y,x) π(x)ψ(x,y) 1 if y x, P (x, y) = 1 [ ] Ψ(x, z) π(z)ψ(z,x) π(x)ψ(x,z) 1 (3.5) if y = x. z : z x The reader should check that the transition matrix (3.5) defines a reversible Markov chain with stationary distribution π (see Exercise 3.1). Example 3.3. Suppose you know neither the vertex set V nor the edge set E of a graph G. However, you are able to perform a simple random walk on G. (Many computer and social networks have this form; each vertex knows who its neighbors are, but not the global structure of the graph.) If the graph is not regular, then the stationary distribution is not uniform, so the distribution of the walk will not converge to uniform. You desire a uniform sample from V. We can use the Metropolis algorithm to modify the simple random walk and ensure a uniform stationary distribution. The acceptance probability in (3.4) reduces in this case to deg(x) deg(y) 1. This biases the walk against moving to higher degree vertices, giving a uniform stationary distribution. Note that it is not necessary to know the size of the vertex set to perform this modification, which can be an important consideration in applications Glauber Dynamics We will study many chains whose state spaces are contained in a set of the form S V, where V is the vertex set of a graph and S is a finite set. The elements of S V, called configurations, are the functions from V to S. We visualize a configuration as a labeling of vertices with elements of S. Given a probability distribution π on a space of configurations, the Glauber dynamics for π, to be defined below, is a Markov chain which has stationary distribution π. This chain is often called the Gibbs sampler, especially in statistical contexts Two examples. As we defined in Section 3.1, a proper q-coloring of a graph G = (V, E) is an element x of {1, 2,..., q} V, the set of functions from V to {1, 2,..., q}, such that x(v) x(w) for all edges {v, w}. We construct here a Markov chain on the set of proper q-colorings of G. For a given configuration x and a vertex v, call a color j allowable at v if j is different from all colors assigned to neighbors of v. That is, a color is allowable at v if it does not belong to the set {x(w) : w v}. Given a proper q-coloring x, we can generate a new coloring by selecting a vertex v V at random, selecting a color j uniformly at random from the allowable colors at v, and

9 4.2. COUPLING AND TOTAL VARIATION DISTANCE 49 which is a useful identity. Remark 4.4. From Proposition 4.2 and the triangle inequality for real numbers, it is easy to see that total variation distance satisfies the triangle inequality: for probability distributions µ, ν and η, µ ν TV µ η TV + η ν TV. (4.6) Proposition 4.5. Let µ and ν be two probability distributions on X. Then the total variation distance between them satisfies } µ ν TV = 1 2 sup { f(x)µ(x) f(x)ν(x) : max f(x) 1 Proof. If max f(x) 1, then 1 f(x)µ(x) f(x)ν(x) 2 1 µ(x) ν(x) = µ ν 2 TV. Thus, the right-hand side of (4.7) is at most µ ν TV. For the other direction, define { f 1 if µ(x) ν(x), (x) = 1 if µ(x) < ν(x). Then [ 1 f (x)µ(x) ] f (x)ν(x) = 1 f (x)[µ(x) ν(x)] 2 2 = 1 2 µ(x) ν(x) [µ(x) ν(x)] + ν(x)>µ(x) Using (4.5) shows that the right-hand side above equals µ ν TV. right-hand side of (4.7) is at least µ ν TV Coupling and Total Variation Distance. (4.7) [ν(x) µ(x)]. Hence the A coupling of two probability distributions µ and ν is a pair of random variables (X, Y ) defined on a single probability space such that the marginal distribution of X is µ and the marginal distribution of Y is ν. That is, a coupling (X, Y ) satisfies P{X = x} = µ(x) and P{Y = y} = ν(y). Coupling is a general and powerful technique; it can be applied in many different ways. Indeed, Chapters 5 and 14 use couplings of entire chain trajectories to bound rates of convergence to stationarity. Here, we offer a gentle introduction by showing the close connection between couplings of two random variables and the total variation distance between those variables. Example 4.6. Let µ and ν both be the fair coin measure giving weight 1/2 to the elements of {0, 1}. (i) One way to couple µ and ν is to define (X, Y ) to be a pair of independent coins, so that P{X = x, Y = y} = 1/4 for all x, y {0, 1}.

10 50 4. INTRODUCTION TO MARKOV CHAIN MIXING (ii) Another way to couple µ and ν is to let X be a fair coin toss and define Y = X. In this case, P{X = Y = 0} = 1/2, P{X = Y = 1} = 1/2, and P{X Y } = 0. Given a coupling (X, Y ) of µ and ν, if q is the joint distribution of (X, Y ) on X X, meaning that q(x, y) = P{X = x, Y = y}, then q satisfies q(x, y) = P{X = x, Y = y} = P{X = x} = µ(x) y X and y X q(x, y) = P{X = x, Y = y} = P{Y = y} = ν(y). Conversely, given a probability distribution q on the product space X X which satisfies q(x, y) = µ(x) and q(x, y) = ν(y), y X there is a pair of random variables (X, Y ) having q as their joint distribution and consequently this pair (X, Y ) is a coupling of µ and ν. In summary, a coupling can be specified either by a pair of random variables (X, Y ) defined on a common probability space or by a distribution q on X X. Returning to Example 4.6, the coupling in part (i) could equivalently be specified by the probability distribution q 1 on {0, 1} 2 given by q 1 (x, y) = 1 4 for all (x, y) {0, 1} 2. Likewise, the coupling in part (ii) can be identified with the probability distribution q 2 given by { 1 q 2 (x, y) = 2 if (x, y) = (0, 0), (x, y) = (1, 1), 0 if (x, y) = (0, 1), (x, y) = (1, 0). Any two distributions µ and ν have an independent coupling. However, when µ and ν are not identical, it will not be possible for X and Y to always have the same value. How close can a coupling get to having X and Y identical? Total variation distance gives the answer. Proposition 4.7. Let µ and ν be two probability distributions on X. Then µ ν TV = inf {P{X Y } : (X, Y ) is a coupling of µ and ν}. (4.8) Remark 4.8. We will in fact show that there is a coupling (X, Y ) which attains the infimum in (4.8). We will call such a coupling optimal. Proof. First, we note that for any coupling (X, Y ) of µ and ν and any event A X, µ(a) ν(a) = P{X A} P{Y A} (4.9) P{X A, Y A} (4.10) P{X Y }. (4.11) (Dropping the event {X A, Y A} from the second term of the difference gives the first inequality.) It immediately follows that µ ν TV inf {P{X Y } : (X, Y ) is a coupling of µ and ν}. (4.12)

11 4.2. COUPLING AND TOTAL VARIATION DISTANCE 51 Μ I III II Ν Figure 4.2. Since each of regions I and II has area µ ν TV and µ and ν are probability measures, region III has area 1 µ ν TV. It will suffice to construct a coupling for which P{X Y } is exactly equal to µ ν TV. We will do so by forcing X and Y to be equal as often as they possibly can be. Consider Figure 4.2. Region III, bounded by µ(x) ν(x) = min{µ(x), ν(x)}, can be seen as the overlap between the two distributions. Informally, our coupling proceeds by choosing a point in the union of regions I and III, and setting X to be the x-coordinate of this point. If the point is in III, we set Y = X and if it is in I, then we choose independently a point at random from region II, and set Y to be the x-coordinate of the newly selected point. In the second scenario, X Y, since the two regions are disjoint. More formally, we use the following procedure to generate X and Y. Let p = µ(x) ν(x). Write µ(x) ν(x) =, µ(x) ν(x) µ(x) +, µ(x)>ν(x) ν(x). Adding and subtracting x : µ(x)>ν(x) µ(x) to the right-hand side above shows that µ(x) ν(x) = 1 [µ(x) ν(x)]., µ(x)>ν(x) By equation (4.5) and the immediately preceding equation, µ(x) ν(x) = 1 µ ν TV = p. (4.13) Flip a coin with probability of heads equal to p. (i) If the coin comes up heads, then choose a value Z according to the probability distribution µ(x) ν(x) γ III (x) =, p and set X = Y = Z.

12 52 4. INTRODUCTION TO MARKOV CHAIN MIXING (ii) If the coin comes up tails, choose X according to the probability distribution γ I (x) = { µ(x) ν(x) µ ν TV if µ(x) > ν(x), 0 otherwise, and independently choose Y according to the probability distribution γ II (x) = { ν(x) µ(x) µ ν TV if ν(x) > µ(x), 0 otherwise. Note that (4.5) ensures that γ I and γ II are probability distributions. Clearly, pγ III + (1 p)γ I = µ, pγ III + (1 p)γ II = ν, so that the distribution of X is µ and the distribution of Y is ν. Note that in the case that the coin lands tails up, X Y since γ I and γ II are positive on disjoint subsets of X. Thus X = Y if and only if the coin toss is heads. We conclude that P{X Y } = µ ν TV The Convergence Theorem We are now ready to prove that irreducible, aperiodic Markov chains converge to their stationary distributions a key step, as much of the rest of the book will be devoted to estimating the rate at which this convergence occurs. The assumption of aperiodicity is indeed necessary recall the even n-cycle of Example 1.4. As is often true of such fundamental facts, there are many proofs of the Convergence Theorem. The one given here decomposes the chain into a mixture of repeated independent sampling from the stationary distribution and another Markov chain. See Exercise 5.1 for another proof using two coupled copies of the chain. Theorem 4.9 (Convergence Theorem). Suppose that P is irreducible and aperiodic, with stationary distribution π. Then there exist constants α (0, 1) and C > 0 such that max P t (x, ) π TV Cα t. (4.14) Proof. Since P is irreducible and aperiodic, by Proposition 1.7 there exists an r such that P r has strictly positive entries. Let Π be the matrix with X rows, each of which is the row vector π. For sufficiently small δ > 0, we have P r (x, y) δπ(y) for all x, y X. Let θ = 1 δ. The equation P r = (1 θ)π + θq (4.15) defines a stochastic matrix Q. It is a straightforward computation to check that MΠ = Π for any stochastic matrix M and that ΠM = Π for any matrix M such that πm = π. Next, we use induction to demonstrate that P rk = ( 1 θ k) Π + θ k Q k (4.16)

13 30 CHAPTER 2. BOUNDING THE MIXING TIME i.e., we change the ith component from x i to y i. Note that some of the edges may be loops (if x i = y i ). To compute, fix attention on a particular (oriented) edge t = (w,w ) = ( (w 0,...,w i,...w n 1 ), (w 0,...,w i,...w n 1 ) ), and consider the number of canonical paths γ xy that include t. The number of possible choices for x is 2 i, as the final n i positions are determined by x j = w j, for j i; and by a similar argument the number of possible choices for y is 2 n i 1. Thus the total number of canonical paths using a particular edge t is 2 n 1 ; furthermore, Q(w,w ) = π(w)p(w,w ) 2 n (2n) 1, and the length of every canonical path is exactly n. Plugging all these bounds into the definition of ρ yields ρ n 2. Thus, by Theorem 2.2.4, the mixing time of W n is τ(ε) n 2 (n ln q + ln ε 1 ) Comparison Theorems Decomposition Theorem 2.3 Coupling A coupling C(M) for M is a stochastic process (X t,y t ) on Ω Ω such that each of X t, Y t is marginally a copy of M, Pr(X t = σ 1 X t 1 = ω 1 ) = P(ω 1,σ 1 ), Pr(Y t = σ 2 Y t 1 = ω 2 ) = P(ω 2,σ 2 ), ( t > 0). (2.18) The following simple but powerful inequality then follows easily from these definitions. Lemma (Coupling Lemma) Let X t,y t be a coupling for M such that Y 0 has the stationary distribution π. Then, if X t has distribution p t, D tv (p t,π) Pr(X t Y t ). (2.19) Proof Suppose A t Ω maximizes in (1.3). Then, since Y t has distribution π, D tv (p t,π) = Pr(X t A t ) Pr(Y t A t ) Pr(X t A t,y t / A t ) Pr(X t Y t ). It is important to remember that the Markov chain Y t is simply a proof construct, and X t the chain we actually observe. We also require that X t = Y t implies X t+1 = Y t+1,

14 2.3. COUPLING 31 since this makes the right side of (2.19) nonincreasing. Then the earliest epoch T at which X T = Y T is called coalescence, making T a random variable. A successful coupling is such that lim t Pr(X t Y t ) = 0. Clearly we are only interested in successful couplings. As an example consider our random walk on the cube Q n. We can define a coupling as follows: Given (X t,y t ) we (a) Choose i uniformly at random from [n]. (b) Put X t+1,j = X t,j and Y t+1,j = Y t,j for j i. (c) If X t,i = Y t,i then (d) otherwise X t,i prob 1 2 X t+1,i = Y t+1,i = 1 X t,i prob 1 2 (X t,i, 1 Y t,i ) prob 1 2 (X t+1,i,y t+1,i ) = (1 X t,i,y t,i ) prob 1 2 It should hopefully be clear that this is a coupling i.e. the marginals are correct and X t = Y t implies X t+1 = Y t+1. Now let I t = {j : i is chosen in (a) of steps 1, 2,...,t. Then I t = [n] implies that X τ = Y τ for τ t. So Pr(X t Y t ) Pr(I t [n]) = Pr(Īt ) E( Īt ) = n ( 1 1 n) t. So if t = n(log n + log ǫ 1 ) we have d TV (p t,π) ǫ. A coupling is a Markovian coupling if the process C(M) is a Markov chain on Ω Ω. There always exists a maximal coupling, which gives equality in (2.19). This maximal coupling is in general non-markovian, and is seemingly not constructible without knowing p t (t = 1, 2,...). But coupling has little algorithmic value if we already know p t. More generally, it seems difficult to prove mixing properties of non-markovian couplings in our setting. Therefore we restrict attention to Markovian couplings, at the (probable) cost of sacrificing equality in (2.19). Let C(M) be a Markovian coupling, with Q its transition matrix, i.e. the probability of a joint transition from (ω 1,ω 2 ) to (σ 1,σ 2 ) is Q ω 1ω 2 σ 1 σ 2. The precise conditions required of

15 32 CHAPTER 2. BOUNDING THE MIXING TIME Q are then Q ω ω σ 1 σ 2 0 implies σ 1 = σ 2 ( ω Ω), (2.20) Q ω 1ω 2 σ 1 σ 2 = P ω 1 σ 1 ( ω 2 Ω), Q ω 1ω 2 σ 1 σ 2 = P ω 2 σ 2 ( ω 1 Ω). (2.21) σ 2 Ω Here (2.20) implies equality after coalescence, and (2.21) implies the marginals are copies of M. Our goal is to design Q so that Pr(X t Y t ) quickly becomes small. We need only specify Q to satisfy (2.21) for ω 1 ω 2. The other entries are completely determined by (2.20) and (2.21). In general, to prove rapid mixing using coupling, it is usual to map C(M) to a process on N by defining a function ψ : Ω Ω N such that ψ(ω 1,ω 2 ) = 0 implies ω 1 = ω 2. We call this a proximity function. Then Pr(X t Y t ) E(ψ(X t,y t )), by Markov s inequality, and we need only show that E(ψ(X t,y t )) converges quickly to zero. σ 1 Ω 2.4 Path coupling A major difficulty with coupling is that we are obliged to specify it, and show improvement in the proximity function, for every pair of states. The idea of path coupling, where applicable, can be a major saving in this respect. We describe the approach below. As a simple example of this approach consider a Markov chain where Ω S m for some set S and positive integer m. Suppose also that if ω,σ Ω and h(ω,σ) = d (Hamming distance) then there exists a sequence ω = x 0,x 1,...,x d = σ of members of Ω such that (i) {x 0,x 1,...,x d } Ω, (ii) h(x i,x i+1 ) = 1, i = 0, 1,...,d 1 and (iii) P(x i,x i+1 ) > 0. Now suppose we define a coupling of the chains (X t,y t ) only for the case where h(x t,y t ) = 1. Suppose then that for some β < 1. Then in all cases. It then follows that E(h(X t+1,y t+1 ) h(x t,y t ) = 1) β (2.22) E(h(X t+1,y t+1 )) βh(x t,y t ), (2.23) d TV (p t,π) Pr(X t Y t ) nβ t. Equation (2.23) is shown by choosing a sequence X t = Z 0,Z 1,...,Z d = Y t,d = h(x t,y t ) Z 0,Z 1,...,Z d satisfy (i),(ii),(iii) above. Then we can couple Z i and Z i+1, 1 i < d so that X t+1 = Z 0,Z 1,...,Z d = Y t+1 and (i) Pr(Z i = σ Z i = ω) = P(ω,σ) and (ii)

16 2.4. PATH COUPLING 33 E(h(Z i,z i+1)) β. Therefore E(h(X t+1,y t+1 )) d E(h(Z i,z i+1)) βd i=1 and (2.23) follows. As an example, let G = (V,E) be a graph with maximum degree and let k be an integer. Let Ω k be the set of proper k- vertex colourings of G i.e. {c : V [k]} such that (v,w) E implies c(v) c(w). We describe a chain which provides a good sampler for the uniform distribution over Ω k. We let Ω = V k be all k-colourings, including improper ones and describe a chain on Ω for which only proper colourings have a positive steady state probability. To describe a general step of the chain asume X t Ω. Then Step 1 Choose w uniformly from V and x uniformly from [k]. Step 2 Let X t+1 (v) = X t (v) for v V \ {w}. Step 3 If no neighbour of w in G has colour x then put X t+1 (w) = x, otherwise put X t+1 (w) = x. Note that P(ω,σ) = P(σ,ω) = 1 for two proper colourings which can be obtained from nk each other by a single move of the chain. It follows from (1.15) that the steady state is uniform over Ω k. We first describe a coupling which is extremely simple but needs k > 3 in order for (2.22) to be satisfied. Let h(x t,y t ) = 1 and let v 0 be the unique vertex of V such that X t (v) Y t (v). In our coupling we choose w,x as in Step 1 and try to colour w with x in both chains. We claim that E(h(X t+1,y t+1 ) 1 1 n ( 1 ) + 2 k n k = 1 k 3 kn. (2.24) and so we can take β 1 1 in (2.23) if k > 3. kn ( ) The term 1 n 1 k in (2.24) lower bounds the probability that w = v0 and that x is not used in the neighbourhood of v 0. In which case we will have X t+1 = Y t+1. Next let c X c Y be the colours of v 0 in X t,y t respectively. The term 2 in (2.24) is an n k upper bound for the probability that w is in the neighbourhood of v 0 and x {c X,c Y } and in which case we might have h(x t+1,y t+1 ) = 2. In all other cases we find that h(x t+1,y t+1 ) = h(x t,y t ) = 1.

17 34 CHAPTER 2. BOUNDING THE MIXING TIME A better coupling gives the desired result. We proceed as above except for the case where w is a neighbour of v 0 and x {c X,c Y }. In this case with probability 1 we try 2 to colour w with c X in X t and colour w with c Y in Y t, and fail in both cases. With probability 1 we try to colour w with c 2 Y in X t and colour w with c X in Y t, in which case the hamming distance may increase by one. Thus for this coupling we have E(h(X t+1,y t+1 ) 1 1 ( 1 ) n k 2 n k = 1 k 2 kn and we can take β 1 1 kn in (2.23) if k > 2. We now give a more general framework for the definition of path coupling. Recall that a quasi-metric satisfies the conditions for a metric except possibly the symmetry condition. Any metric is a quasi-metric, but a simple example of a quasi-metric which is not a metric is directed edge distance in a digraph. Suppose we have a relation S Ω Ω such that S has transitive closure Ω Ω, and suppose that we have a proximity function defined for all pairs in S, i.e. ψ : S N. Then we may lift ψ to a quasi-metric φ(ω,ω ) on Ω as follows. For each pair (ω,ω ) Ω Ω, consider the set P(ω,ω ) of all sequences ω = ω 1,ω 2,...,ω r 1,ω r = ω with (ω i,ω i+1 ) S (i = 1,...,r 1). (2.25) Then we set φ(ω,ω ) = min P(ω,ω ) r 1 ψ(ω i,ω i+1 ). (2.26) It is easy to prove that φ is a quasi-metric. We call a sequence minimizing (2.26) geodesic. We now show that, without any real loss, we may define the (Markovian) coupling only on pairs in S. Such a coupling is a called a path coupling. We give a detailed development below. Clearly S = Ω Ω is always a relation whose transitive closure is Ω Ω, but path coupling is only useful when we can define a suitable S which is much smaller than Ω Ω. A relation of particular interest is R σ from Section 1.4, but this is not always the best choice. As in Section 2.3, we use σ (or σ i ) to denote a state obtained by performing a single transition of the chain from the state ω (or ω i ). Let Pσ ω denote the probability of a transition from state ω to state σ in the Markov chain, and let Q ωω σσ denote the probability of a joint transition from (ω,ω ) to (σ,σ ), where (ω,ω ) S, as specified by the path coupling. Since this coupling has the correct marginals, we have σ Ω Q ωω σσ = P ω σ, σ Ω i=1 Q ωω σσ = P ω σ ( (ω,ω ) S). (2.27) We extend this to all pairs (ω,ω ) Ω Ω, as follows. For each pair, fix a sequence (ω 1,ω 2,...,ω r ) P(ω,ω ). We do not assume the sequence is geodesic here, or indeed

18 2.4. PATH COUPLING 35 the existence of any proximity function, but this is our eventual purpose. The implied global coupling Q ω 1ω r σ 1 σ r is then defined along this sequence by successively conditioning on the previous choice. Using (2.27), this can be written explicitly as Q ω 1ω r σ 1 σ r = σ 2 Ω σ 3 Ω σ r 1 Ω Q ω 1ω 2 σ 1 σ 2 Q ω 2ω 3 σ 2 σ 3 P ω 2 σ 2... Qωr 1ωr σ r 1 σ r. (2.28) P ω r 1 σ r 1 Summing (2.28) over σ r or σ 1, and again applying (2.27), causes the right side to successively simplify, giving Q ω 1ω r σ 1 σ r = P ω 1 σ 1 ( ω r Ω), Q ω 1ω r σ 1 σ r = P ω r σ r ( ω 1 Ω). (2.29) σ r Ω σ 1 Ω Hence the global coupling satisfies (2.21), as we would anticipate from the properties of conditional probabilities. Now suppose the global coupling is determined by geodesic sequences. We bound the expected value of φ(σ 1,σ r ). This is E(φ(σ 1,σ r )) = σ 1 σ r r 1 σ 1 σ r φ(σ 1,σ r ) Qω1ω2 σ 1 σ 2 Q ω 2ω 3 σ 2 σ 3 Q ω r 1ω r σ r 1 σ r P ω 2 σ 2 P ω r 1 σ r 1 i=1 r 1 = σ 1 i=1 σ r φ(σ i,σ i+1 ) Qω 1ω 2 σ 1 σ 2 Q ω 2ω 3 σ 2 σ 3 Q ω r 1ω r σ r 1 σ r P ω 2 σ 2 P ω r 1 σ r 1 φ(σ i,σ i+1 ) Qω1ω2 σ 1 σ 2 Q ω 2ω 3 σ 2 σ 3 Q ω r 1ω r σ r 1 σ r P ω 2 σ 2 P ω r 1 σ r 1 r 1 = φ(σ i,σ i+1 )Q ωiωi+1 σ i σ i+1, (2.30) σ i+1 i=1 σ i where we have used the triangle inequality for a quasi-metric and the same observation as that leading from (2.28) to (2.29). Suppose we can find β 1, such that, for all (ω,ω ) S, E(φ(σ,σ )) = φ(σ,σ )Q ωω σσ β φ(ω,ω ). (2.31) σ Then, from (2.30), (2.31) and (2.26) we have σ E(φ(σ 1,σ r )) r 1 β φ(ω i,ω i+1 ) = β i=1 r 1 φ(ω i,ω i+1 ) = β φ(ω 1,ω r ). (2.32) i=1 Thus we can show (2.31) for every pair, merely by showing that this holds for all pairs in S. To apply path coupling to a particular problem, we must find a relation S and

19 36 CHAPTER 2. BOUNDING THE MIXING TIME proximity function ψ so that this is possible. In particular we need φ(ω,ω ) for (ω,ω ) S to be easily deducible from ψ. Suppose that Ω has diameter D, i.e. φ(ω,ω ) D for all ω,ω Ω. Then, Pr(X t Y t ) β t D and so if β < 1 we have, since log β 1 1 β, D tv (p t,π) ε for t log(dε 1 )/(1 β). (2.33) This bound is polynomial even when D is exponential in the problem size. It is also possible to prove a bound when β = 1, provided we know the quasi-metric cannot get stuck. Specifically, we need an α > 0 (inversely polynomial in the problem size) such that, in the above notation, Pr(φ(σ,σ ) φ(ω,ω )) α ( ω,ω Ω). (2.34) Observe that it is not sufficient simply to establish (2.34) for pairs in S. However, the structure of the path coupling can usually help in proving it. In this case, we can show that D tv (p t,π) ε for t ed 2 /α ln(ε 1 ). (2.35) This is most easily shown using a martingale argument. Here we need D to be polynomial in the problem size. Consider a sequence (ω 0,ω 0 ), (ω 1,ω 1 )...,(ω t,ω t ) and define the random time T ω,ω = min {t : φ(ω t,ω t) = 0}, assuming that ω 0 = ω,ω 0 = ω. We prove that E(T ω,ω ) D 2 /α. (2.36) Let and let Then Z(t) = φ(ω t,ω t )2 2Dφ(ω t,ω t ) αt δ(t) = φ(ω t+1,ω t+1) φ(ω t,ω t). E(Z(t + 1) Z(0),Z(1),...,Z(t)) Z(t) = 2(φ(ω t,ω t) D)E(δ(t) ω t,ω t) + (E(δ(t) 2 ω t,ω t) α) 0. Hence Z(t) is a submartingale. The stopping time T ω,ω has finite expectation and Z(t + 1) Z(t) D 2. We can therefore apply the Optional Stopping Theorem for submartingales to obtain E(Z(T ω,ω )) Z(0). This implies and (2.36) follows. αe(t ω,ω ) δ(0) 2 2Dδ(0)

20 2.5. HITTING TIME LEMMAS 37 So for any ω,ω Pr(T ω,ω ed 2 /α) e 1 and by considering k consecutive time intervals of length k we obtain Pr(T ω,ω ked 2 /α) e k and (2.35) follows. 2.5 Hitting Time Lemmas For a finite Markov chain M let Pr i,e i denote probability and expectation, given that X 0 = i. For a set A Ω let Then for i j the hitting time T A = min {t 0 : X t A}. H i,j = E i (T j ) is the expected number of steps needed to get from state i to state j. The commute time C i,j = H i,j + H j,i. Lemma Assume X 0 = i and S is a stopping time with X S = i. Let j be an arbitrary state. Then E i (number of visits to state j before time S) = π j E i (S). Proof Consider the renewal process whose inter-renewal time is distributed as S. The reward-renewal theorem states that the asymptotic proportion of time spent in state j is given by E i (number of visits to j before time S)/E i (S). This also equal to π j, by the ergodic theorem. Lemma E j (number of visits to j before T i ) = π j C i,j. Proof Let S be the time of the first return to i after the first visit to j. Apply Lemma The cover time C(M) of M is max i C i (M) where C i (M) = E i (max j T j ) is the expected time to visit all states starting at i.

21 ÔØÖ ÓÙÒÒ Ø ÅÜÒ ÌÑ º½ ËÔØÖÐ Ô ÄØ È Ø ØÖÒ ØÓÒ ÑØÖÜ Ó Ò ÖÓ ÖÚÖ Ð ÅÖÓÚ Ò ÓÒ ØØ Ô Å ÄØ Ø ØØÓÒÖÝ ØÖÙØÓÒº ÄØ Æ Å Ò ÙÑ ÛºÐºÓºº ØØ Å ¼ ½ Æ ½º ÄØ Ø ÒÚÐÙ Ó È ½ ¼ ½ Æ ½ º ÌÝ Ö ÐÐ ÖÐ ÚÐÙº ÄØ ÑÜ ÑÜ ¼º Ì Ø ØØ ÑÜ ½ Ð Ð Ö ÙÐØ Ó Ø ØÓÖÝ Ó ÒÓÒ¹ÒØÚ ÑØÖ º Ì ÔØÖÐ Ô ½ ÑÜ ØÖÑÒ Ø ÑÜÒ ÖØ Ó Ø Ò Ò Ò ÒØÐ ÛÝº Ì ÐÖÖ Ø Ø ÑÓÖ ÖÔÐÝ Ó Ø Ò ÑÜº ÓÖ Í Å ÐØ Í Øµ ÑÜ Í ÌÓÖÑ º½º½ ÓÖ ÐÐ Í Å Ò Ø ¼ È Ø µ µ µ Í Ø ÑÜ ÑÒ Í µ ÄØ ½ Ø ÓÒÐ Å Å ÑØÖÜ ÛØ ÓÒÐ ÒØÖ Ô µ Å ÈÖÓÓ Ò ÐØ ½ Ø ÒÚÖ º ÌÒ Ø ÖÚÖ ÐØÝ Ó Ó Ø Ò ½º½µ ÑÔÐ ØØ Ø ÑØÖÜ Ë ½ È ½ ÝÑÑØÖº ÁØ Ø Ñ ÒÚÐÙ È Ò Ø ÝÑÑØÖÝ ÑÒ ØØ Ø Ö ÐÐ ÖÐº Ï Ò ÐØ Ò ÓÖØÓÒÓÖÑÐ Ó ÓÐÙÑÒ ÚØÓÖ µ Å ÓÖ Ê Å ÓÒ ØÒ Ó ÐØ ÒÚØÓÖ Ó Ë ÛÖ µ ÓØ ÒÚÐÙ Ò ¼µ Ì ½ º Ë Ø ÔØÖÐ ÓÑÔÓ ØÓÒ Ë Æ ½ ¼ µ µì ½ Æ ½ ¼ µ

22 ½ ÀÈÌÊ º ÇÍÆÁÆ ÌÀ ÅÁÁÆ ÌÁÅ ÛÖ µ µ µì º ÆÓØ È ØØ µ µ ¼ ÓÖ Ò µ µ º ÁØ ÓÐÐÓÛ ØØ ÓÖ ÒÝ Ø ¼ ½ Ë Ø Æ ½ Ø µ º ÀÒ ¼ È Ø ½ Ë Ø ½ Æ ½ ¼ ½ Æ Ì Ø ½ µ µµ µì ½ µ Æ ½ ½ Ø ½ µ µµ µì ½ µ ÛÖ ½ Æ Ø Æ¹ÚØÓÖ ÐÐ Ó ÛÓ ÓÑÔÓÒÒØ Ö ½º ÁÒ ÓÑÔÓÒÒØ ÓÖÑ Û Ø ÛØ Ø ÐÔ Ó Ø ÙÝ¹ËÛÖØÞ ÒÕÙÐØÝ È Ø µ Ö Ø ÑÜ ¼ Ö Æ ½ Ø µ µ ½ Æ ½ µ Ö Ø ÑÜ ½ Æ ½ ¼ ½ µ º½µ Ì ØÓÖÑ ÓÐÐÓÛ Ý Ù ØØÙØÓÒ Ó Ø ÓÚ ÒÕÙÐØÝ Ò Ø ÒØÓÒ Ó Í º ÁÒ ØÖÑ Ó ÑÜÒ ØÑ Û Ú ÓÖÓÐÐÖÝ º½º½ ÐÓ ÑÒ µ ÐÓ ÑÜ ÈÖÓÓ ÓÖ Å Û Ú Ô Ø µ µ Ø ÑÜ µ Ø ÑÜ ÑÒ ÑÒ Ò ÜÑÔÐ Û ÓÒ Ö ÖÒÓÑ ÛÐ Ï Ò ÓÒ Ø ÙÒØ ÝÔÖÙº ÀÖ Ø ÖÔ Ø Ò¹Ù É Ò Ò ¼ ½ Ò Ò µ ÛÖ Ü Ý Ò Ö ÒØ Ò É Ò ØÖ ÀÑÑÒ ØÒ ÓÒ ºº Ò Ü Ý ½º Ï Ò Ð ÐÓÓÔ ØÓ ÚÖØÜ ØÓ Ñ Ø Ò ÐÞÝº Á ¹ÖÙÐÖ ÖÔ ÛØÓÙØ ÐÓÓÔ Ò Ø ÒÝ ÑØÖÜ ØÒ Ø ÔÖÓ¹ ÐØÝ ØÖÒ ØÓÒ ÑØÖÜ È Ó ÖÒÓÑ ÛÐ ÓÒ Ø È ½ º ÓÖ ÖÔ Î µ ½ Û Ò Ò ØÖ ÔÖÓÙØ ½ Î µ ÛÖ Î Î ½ Î Ò Ú ½ Ú µ Û ½ Û µµ Ú ½ Û ½ Ò Ú Û µ ÓÖ Ú Û Ò Ú ½ Û ½ µ ½ º ÌÒ É Ò Ã Ã Ã Ò ÓÐ ÔÖÓÙØµ ºµ

23 ºº ÇÆÍÌÆ ½ ÌÓÖÑ º½º Á ½ Ñ Ò ½ Ò Ö Ø ÒÚÐÙ Ó Ñ¹ ØÖ ½ Òº ÈÖÓÓ Ö ÔØÚÐÝ ØÒ Ø ÒÚÐÙ Ó Ö ½ Ñ ½ Ò ÓØÒ ÖÓÑ ½ Ý ÖÔÐÒ ½ Ý Ø Î ÒØØÝ ÑØÖÜ Á Ø Ó«¹ÓÒÐ ¼³ Ý Ø Î Î ÑØÖÜ Ó ¼³ Ò ÖÔÐÒ ÓÒÐ ÒØÖÝ Ý º ËÓ Ô µ Ø Á µ ØÒ Ô µ Ø Ô ½ Á µ Ì ÓÐÐÓÛ ÖÓÑ Ø ÓÐÐÓÛÒ ËÙÔÔÓ Ø ÑÒ ÑÒ ÑØÖÜ ÓÑÔÓ ÒØÓ Ò Ñ Ñ ÑØÖÜ Ó Ò Ò ÐÓ º ËÙÔÔÓ Ð Ó ØØ Ø ÓÑÑÙØ ÑÓÒ ØÑ ÐÚ º ÌÒ Ø Ø ½µ Ò µ Ñ ½ µ ºº ÓÒ Ò ÔÖÓÙ Ò Ñ Ñ ÑØÖÜ Ý ØÖÑÒÒØ ÐÙÐØÓÒ Ò ØÒ Ø Ø ØÖÑÒÒØº Æ ÔÖÓÓ ËÓ Ô µ Ø Ò Á Á µ Ò Ô µ Ò Ò µ ½ ½ ½ ½ Ì ÒÚÐÙ Ó Ã Ö ½ ½ Ò ÔÔÐÝÒ ºµ Û ØØ Ø ÒÚÐÙ Ó É Ò Ö ¼ ½ Ò ÒÓÖÒ ÑÙÐØÔÐØ µº ÌÓ Ø Ø ÒÚÐÙ ÓÖ ÓÙÖ ÖÒÓÑ ÛÐ Û µ Ú Ý Ò Ò ØÒ µ ÖÔÐ ÒÚÐÙ Ý ½ ØÓ ÓÙÒØ ÓÖ Ò ÐÓÓÔ º ÌÙ Ø ÓÒ ÒÚÐÙ Ó Ø ÛÐ ½ ½ º Ò ÔÔÐÝÒ ÓÖÓÐÐÖÝ º½º½ Û ÓØÒ µ ÐÓ ½ µ Ç Ò µº Ì ÔÓÓÖ ØÑØ Ù ØÓ ÓÙÖ Ù Ó Ø ÙÝ¹ËÛÖØÞ ÒÕÙÐØÝ Ò Ø ÔÖÓÓ Ó ÌÓÖÑ º½º½º Ï Ø Ò Ö Ò ØØÖ ØÑØ Ý Ù Ò ÓÙÔÐÒº º½º½ º ÓÑÔÓ ØÓÒ ÌÓÖÑ ÓÒÙØÒ Ì ÓÒÙØÒ Ó Å Ò Ý ÑÒ Ë Ë Å ¼ Ëµ ½ ÛÖ É µ µè µ Ò Ë Å Ò Ë Ë Ëµ ½ É Ë Ëµ

24 ¼ ÀÈÌÊ º ÇÍÆÁÆ ÌÀ ÅÁÁÆ ÌÁÅ ÌÙ Ë Ø ØÝ ØØ ÔÖÓÐØÝ Ó ÑÓÚÒ ÖÓÑ Ë ØÓ Ë Ò ÓÒ ØÔ Ó Ø Ò ÓÒØÓÒÐ ÓÒ Ò Ò Ëº ÐÖÐÝ ½ Å ÐÞÝº ÆÓØ ØØ Ë Ëµ É Ë Ëµ É Ë Ëµ Ë Ëµ ÁÒ É Ë Ëµ É Å Ëµ É Ë Ëµ Ëµ É Ë Ëµ É Ë Ëµ ÄØ ÑÒ ÑÒ µ Å ¼ Ò ÑÜ ÑÜ µ Åº º µ ºº½ ÊÚÖ Ð Ò ÁÒ Ø ØÓÒ Û ÓÛ ÓÛ ÓÒÙØÒ Ú Ù Ò ØÑØ Ó Ø ÔØÖÐ Ô Ó ÖÚÖ Ð Òº ÄÑÑ ºº½ Á Å ÐÞÝ Ò ÖÓ ØÒ ÐÐ ÒÚÐÙ Ö ÔÓ ØÚº ÈÖÓÓ É È Á ¼ ØÓ Ø Ò ÒÚÐÙ ½ ¼ ½ Æ ½º Ì Ö ÙÐØ ÓÐÐÓÛ ÖÓÑ ½ ¼ ½ Æ ½º ÓÖ Ý Ê Æ ÐØ Ý Ýµ È Ý Ý µ ÄÑÑ ºº Á Å ÖÚÖ Ð ØÒ Ý Ýµ ½ ½ ÑÒ È Ì Ý¼ Ý ÈÖÓÓ ÄØ Ë ¼µ Ò ËØÓÒ º½º ÌÒ Ý Ø ÊÝÐ ÔÖÒÔÐ ½ ÑÜ Ì ½ Ü¼ Ü Ì ½ È ½ Ü Ü Ì Ü ÌÙ ½ ½ Ü Ì ½ Á È µ ½ Ü ÑÒ Ì ½ Ü¼ Ü Ì Ü ÑÒ Ì Ý¼ Ý Ì Á È µý Ý Ì Ý ºµ

25 ºº ÇÆÍÌÆ ½ ÆÓÛ Ý Ì Á È µý Ý Ý È ½ È µý Ý Ý È È Ý Ý µ È Ý Ý Ò Ø ÐÑÑ ÓÐÐÓÛ ÖÓÑ ºµº Ý Ýµ ÌÓÖÑ ºº½ Á Å ÖÚÖ Ð Ò ØÒ ½ ½ ÈÖÓÓ ÙÑ ÒÓÛ ØØ Ì Ý ¼ Ý ½ Ý Ý Æ Ò ØØ ÄØ Þ Ý Ý Ö ÓÖ ½ Òº ÌÒ ½ Ö ½ ½ ½ Ö Þ ½ Þ Þ Ö ¼ Þ Ö ½ Þ Æ Ò Ý Ýµ È Ý Þ Þµ ÝÖ È Þ Þ Þµ È Þ È È Þ Þ µ È È Þ Þ µ È Þ µ È È Þ Þ µ ºµ Ý ÆÓÛ È Þ Þ Þ Þ µ Ý ÙÝ¹ËÛÖØÞ È ½ Þ ½ Þ ºµ

26 ÀÈÌÊ º ÇÍÆÁÆ ÌÀ ÅÁÁÆ ÌÁÅ Ï ÚÖÝ ºµ ÐØÖº Ð Ó È Þ Þ µ È Þ Þ µ Þ ËÓ Ý Ýµ È Ý È È È ½ Þ ½ Þ È Þ µ ÆÓÛ ÐØ Ë ½ Ò µ º ÌÒ È Ò Þ Ö ¼º ½ ÌÙ Ì Ý ¼ ØÒ Þ ½ Þ Ò ÌÓÖÑ ºº½ ÓÐÐÓÛ º ÈÖÓÓ Ó ºµ Ï ÓÛ ØØ ØÒ Æ ½ ½ Ö Þ ½ Þ ½ ½ Æ ½ ½ Æ ½ µ È Þ Þ ½µ Ë µ Æ ½ Ö Þ Þ ½µ Ë µ Þ Æ Þ Öµ Þ Ý Ýµ È Ý Þ Þ Þ Þ µ ½ Þ ½ Þ Þ ½ Þ µ ½ Ë µµ ºµ Á Ö ½ ºº Þ Þ Ú Ø Ñ Ò ØÒ ÄÀË ºµÊÀË ºµÞ Þ º ÇØÖÛ ÄÀË ºµ Þ Þ µ Ò ÊÀË ºµÞ Þ º ÁÒ ØÖÑ Ó ÑÜÒ ØÑ Û ÓØÒ ÖÓÑ ÓÖÓÐÐÖÝ º½º½ ÓÖÓÐÐÖÝ ºº½ Á Å ÐÞÝ ÖÓ Ò ØÒ µ ÐÓ ÑÒ

27 ºº ÇÆÍÌÆ ÈÖÓÓ ÄÑÑ ºº½ ÑÔÐ ØØ ½ ÑÜ Ò ØÒ ½ ÐÓ ÑÜ ½ ½ ÐÓ ½ µ ½ ÆÓÛ ÓÒ Ö Ø ÓÒÙØÒ Ó ÖÒÓÑ ÛÐ ÓÒ ÖÔ Î µº ÓÖ Ë Ì Î ÐØ Ë Ì µ Ú Ûµ Ú Ë Û Ì Ò Ë Ì µ Ë Ì µº ÌÒ Ý ÒØÓÒ ÚÛµ ËËµ Ë ÚË ÁÒ ÔÖØÙÐÖ ÛÒ Ò Ö¹ÖÙÐÖ ÖÔ Ú Ú ½ Ú Ë Ëµ Ú ÚË Ö ½ Ë Ëµ ÑÒ Ë ½ Ë Î ºµ Ì ÑÒÑÒ ÓÚ ÖÖÖ ØÓ Ø ÜÔÒ ÓÒ Ó º Ì ÖÔ ÛØ ÓÓ ÜÔÒ ÓÒ ÜÔÒÖ ÖÔ µ Ú ÐÖ ÓÒÙØÒ Ò ÖÒÓÑ ÛÐ ÓÒ ØÑ ÑÜ ÖÔÐÝº Ò ÜÑÔÐ ÓÒ Ö Ø Ò¹Ù É Ò º ÓÖ Ë Ò ÐØ Ò Ëµ ÒÓØ Ø ÒÙÑÖ Ó Ó É Ò Û Ö ÛÓÐÐÝ ÓÒØÒ Ò Ëº ÄÑÑ ºº Á Ë Ò ØÒ Ò Ëµ ½ Ë ÐÓ Ëº ÈÖÓÓ Ï ÔÖÓÚ Ø Ý ÒÙØÓÒ ÓÒ Òº ÁØ ØÖÚÐ ÓÖ Ò ½º ÓÖ Ò ½ ÐØ Ë Ü Ë Ü Ò ÓÖ ½ º ÌÒ Ò Ëµ Ò Ë ¼ µ Ò Ë ½ µ ÑÒË ¼ Ë ½ Ò Ø ØÖÑ ÑÒË ¼ Ë ½ ÓÙÒ Ø ÒÙÑÖ Ó Û Ö ÓÒØÒ Ò Ë Ò ÓÒ Ë ¼ Ë ½ º Ì ÐÑÑ ÓÐÐÓÛ ÖÓÑ Ø ÒÕÙÐØÝ Ü ÐÓ Ü Ý ÐÓ Ý Ý Ü Ýµ ÐÓ Ü Ýµ ÓÖ ÐÐ Ü Ý ¼º Ì ÔÖÓÓ ÐØ ÑÔÐ ÜÖ Ò ÐÙÐÙ º Ý ÙÑÑÒ Ø Ö Ø ÚÖØÜ Ó Ë Û ØØ Ý Ø ÓÚ ÐÑÑ Û Ú Ë Ëµ Ò Ëµ ÒË Ë Ëµ ÒË ½ Ë ÐÓ Ë Ë

µ(, y) Computing the Möbius fun tion µ(x, x) = 1 The Möbius fun tion is de ned b y and X µ(x, t) = 0 x < y if x6t6y 3

µ(, y) Computing the Möbius fun tion µ(x, x) = 1 The Möbius fun tion is de ned b y and X µ(x, t) = 0 x < y if x6t6y 3 ÈÖÑÙØØÓÒ ÔØØÖÒ Ò Ø ÅÙ ÙÒØÓÒ ÙÖ ØÒ ÎØ ÂÐÒ Ú ÂÐÒÓÚ Ò ÐÜ ËØÒÖÑ ÓÒ ÒÖ Ì ØÛÓµ 2314 ½¾ ½ ¾ ¾½ ¾ ½ ½¾ ¾½ ½¾ ¾½ ½ Ì ÔÓ Ø Ó ÔÖÑÙØØÓÒ ÛºÖºØº ÔØØÖÒ ÓÒØÒÑÒØ ½ 2314 ½¾ ½ ¾ ¾½ ¾ ½ ½¾ ¾½ ½¾ ¾½ Ì ÒØÖÚÐ [12,2314] ½ ¾ ÓÑÔÙØÒ